1940s: NLP emerged following WWII and the desire for automated language translation
1960s: Researchers split between symbolic and stochastic NLP
1970s: Further research into areas including logic-based paradigms, natural language understanding, discourse modelling
1980s: Move towards probabilistic models, information generation and extraction
1990s: Introduction of the LSTM model
2010s: Popularisation of deep learning methods including techniques such as word embeddings
2020s: Dominance of the transformer architectures
Text processing
‘Traditional’ NLP requires text preprocessing, since most machine learning models require numeric data as input.
The main stages of text preprocessing are
cleaning
tokenisation
stop word removal
stemming
Cleaning
Data cleaning will vary depending on the context but may include tasks such as
removal of punctuation
removal of digits
conversion to lower case
removal or replacement of special cases (e.g. @hollie → @user)
Tokenisation
Tokenisation is how we divide text into individual tokens. Commonly, a token corresponds to a word, however a tokens could also be
characters
subwords
words
n-grams
Stop words
Stop words can be
🌎 Global: always have low information
Subject-specific: may have low information in a given context (e.g. ‘coffee’ if considering reviews of a cafe)
Document-specific: may have low information in given documents (e.g. words appearing in a document header)
Stemming
Stemming is the replacement of words with a stem word. For instance
walking, walked → walk
story, stories → stori
This can help to reduce the number of unique tokens. But be careful - meaning can be lost!
Lemmatisation is an alternative method that uses a morphological analysis of words to preserve meaning.
How do we develop meaningful representations of language?
Bag of Words
A unique token is assigned to each word in the text
allows development of identity between words
all ordering is discarded
no notion of similarity
How do we develop meaningful representations of language?
Bag of Words
"it was the worst of times"= [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]"it was the age of wisdom"= [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Common words can dominate, potentially without much information, so frequency can be penalised using Term Frequency - Inverse Document Frequency (TF-IDF), allowing distinct words in a given document to have more weight.
How do we develop meaningful representations of language?
Word embeddings
An unsupervised learning approach that learns continuous multidimensional vector representation for each word, by learning to predict a ‘centre word’ given a fixed side window around it
goal is to capture meaning within the embedding
the surrounding words are used to capture meaning
common methods are Word2Vec (Google) and GloVe (Stanford)
How do we develop meaningful representations of language?
BERT’s novelty lies in the way it was pre-trained:
masked language model (MLM) – randomly mask some of the tokens from the input and predict the original vocabulary id of the masked token
next sentence prediction (NSP) – predict if the two sentences were following each other or not
These two tasks are trained concurrently.
BERT
jalammar.github.io/illustrated-bert
BERT
jalammar.github.io/illustrated-bert
In practice
We do not need to implement (or train) any of these ourselves.
huggingface.co/models
Many pre-trained models are available, some of which are also fine tuned for specific tasks, so we will take advantage of this and use a model from HuggingFace 🤗
Discussion
Over to you
What sort of language data do you encounter in your roles and what problems/challenges do you face with this?
What regulations do you think should be placed on the use of language models (if any)?
What does the future look like?
There are post-its on the tables, please add these to the large sheets of paper on the walls
10:00
Workshop
The Twitter API saga…
Getting set up
Connect to WiFi - details at the bottom of the slide
Paste in the url to the GitHub repository and select the notebook sentiment-analysis.ipynb
Data
There are pre-saved data sets available at the same GitHub repository. To access the url for the raw data, navigate to the data set you wish to use and click raw.
Data
Warning
The data set comprises actual tweets obtained using the free API. These have NOT been filtered for toxicity, profanity etc.
HuggingFace model
For our model, we will be using Twitter-roBERTa-base for Sentiment Analysis
roBERTa base model trained on ~124M tweets from January 2018 to December 2021
Fine-tuned for sentiment analysis with the TweetEval benchmark
English language
Encoder transforms tweets into tokens that can be used by the model